31 de marzo de 2019

Task overview

BUSINESS QUESTION: Which are the top 5 products that are going to be more profitable for the company?

What data do we have?

New product attributes and existing product attributes.

  • Predicting sales of four different product types: PC, Laptops, Netbooks and Smartphones
  • Assessing the impact services reviews and customer reviews have on sales of different product types

Index

  1. Data cleaning

  2. Data exploration

  3. Pre-process: feature selection (correlation matrix) & feature engineering

  4. Modalization: linear regresion, KNN, SVM and Random forest

  5. Error analysis

Data cleaning

Transformation to factor:

fact_var <- c("ProductType","ProductNum")
ex_prod[,fact_var] <- apply(ex_prod[,fact_var], 2, as.factor)

Giving names to the rows:

ex_prod <- tibble::column_to_rownames(.data = ex_prod,
                                     var = "ProductNum")
ex_prod$ProductNum <- NULL

Data cleaning: detecting NA with VIM of Best Seller Rank

## 
##  Variables sorted by number of missings: 
##               Variable  Count
##        BestSellersRank 0.1875
##            ProductType 0.0000
##                  Price 0.0000
##          x5StarReviews 0.0000
##          x4StarReviews 0.0000
##          x3StarReviews 0.0000
##          x2StarReviews 0.0000
##          x1StarReviews 0.0000
##  PositiveServiceReview 0.0000
##  NegativeServiceReview 0.0000
##       Recommendproduct 0.0000
##         ShippingWeight 0.0000
##           ProductDepth 0.0000
##           ProductWidth 0.0000
##          ProductHeight 0.0000
##           ProfitMargin 0.0000
##                 Volume 0.0000

1st data expl.: Blackwell business

1st data expl.: Volume distribution

1st modalization: linear regression

# train and test
train_id <- createDataPartition(y = ex_prod$Volume, p = 0.80, list = F)
train <- ex_prod[train_id,]
test <- ex_prod[-train_id,]

# create linear regression model
mod_lm <- lm(formula = Volume ~ ., data = train)
metric train test
RMSE 0 0
R^2 100 % 100 %

Main predictors:

  1. 5 stars
  2. Product type: Game console

2nd pre-process: feature selection

2nd modalization: linear regression

metric train test
RMSE 0 0
R^2 100 % 100 %

Main predictors:

  1. 5 stars
  2. Product type: PC
  3. Price

The model is overfitted again.

3rd pre-process: outliers detection of stars variables

3rd pre-process: feature engineering

3rd pre-process: corr. matrix with total stars

3rd pre-process: corr. matrix with x4 and x2

3rd modalization: linear regression

metric train test
RMSE 307.07 276.19
R^2 71.77 % 80.47 %

My model is not overfitted, but has a very low performance. Let's check where it is failing!

3rd error check: errors visualization linear regression

4th exploration: recommandation variable

4th pre-process: repeated observations

product_num ProductType Pos_Ser Neg_Ser Recomend Vol
132 ExtendedWarranty 0 3 0.4 0
133 ExtendedWarranty 0 1 0.6 20
134 ExtendedWarranty 280 8 0.9 1232
135 ExtendedWarranty 280 8 0.9 1232
136 ExtendedWarranty 280 8 0.9 1232
137 ExtendedWarranty 280 8 0.9 1232
138 ExtendedWarranty 280 8 0.9 1232
139 ExtendedWarranty 280 8 0.9 1232
140 ExtendedWarranty 280 8 0.9 1232
141 ExtendedWarranty 280 8 0.9 1232

4th feature engineering: pos. and neg. service

4th modalization: linear regression

metric train test
RMSE 333.82 241.01
R^2 65.46 % 69.35 %

The model has improved a little bit. Let's see how is performing to the categories we are interested.

4th error check: error visualization lm

5th modalization: using knn with caret

# defining variables to create the model 
rel_var <- c("x4","x2","Pos_Ser","Neg_Ser","Recomend","Vol","PC","Laptop",
             "Netb","Smart_Ph")

# cross validation
ctrl <- caret::trainControl(method = "repeatedcv",
                            number = 10,
                            repeats = 3)

# modalization 
mod_5knn_caret <- caret::train(Vol ~.,
                               method = "knn",
                               data = train[,rel_var],
                               trControl = ctrl, 
                               preProcess = c("center","scale"))
metric train test
RMSE 248.68 153.56
R^2 84.6 % 90.4 %

5th error check: error visualization knn

6th modalization: using Random Forest

set.seed(123)
mod_6rf <- caret::train(Vol ~ .,
                       method = "rf",
                       data = train[,rel_var],
                       trControl = ctrl)
metric train test
RMSE 102.03 84.67
R^2 97 % 98.03 %

6th error check: error visualization rf

7th modalization: using Support Vector Machine

set.seed(123)
mod_7svm <- caret::train(Vol ~ .,
                       method = "svmLinear",
                       data = train[,rel_var],
                       trControl = ctrl)
metric train test
RMSE 373.13 334.86
R^2 61.16 % 44.33 %

7th error check: error visualization SVM

Model application and results